NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

From Priest to Doctor: Domain Adaptation for Low-Resource Neural Machine Translation

Marashian, Ali; Rice, Enora; Gessler, Luke; Palmer, Alexis; von_der_Wense, Katharina (January 2025, Association for Computational Linguistics)

Many of the world’s languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.
more » « less
Full Text Available
Identifying Telescope Usage in Astrophysics Publications: A Machine Learning Framework for Institutional Research Management at Observatories

https://doi.org/10.3847/1538-3881/ad9026

Amado_Olivo, Vicente; Kerzendorf, Wolfgang; Cherinka, Brian; Shields, Joshua V; Didier, Annie; von_der_Wense, Katharina (December 2024, The Astronomical Journal)

Abstract Large scientific institutions, such as the Space Telescope Science Institute, track the usage of their facilities to understand the needs of the research community. Astrophysicists incorporate facility usage data into their scientific publications, embedding this information in plain text. Traditional automatic search queries prove unreliable for accurate tracking due to the misidentification of facility names in plain text. As automatic search queries fail, researchers are required to manually classify publications for facility usage, which consumes valuable research time. In this work, we introduce a machine learning classification framework for the automatic identification of facility usage of observation sections in astrophysics publications. Our framework identifies sentences containing telescope mission keywords (e.g., Kepler and TESS) in each publication. Subsequently, the identified sentences are transformed using term frequency–inverse document frequency and classified with a support vector machine. The classification framework leverages the context surrounding the identified telescope mission keywords to provide relevant information to the classifier. The framework successfully classifies the usage of MAST-hosted missions with a 92.9% accuracy. Furthermore, our framework demonstrates robustness when compared to other approaches, considering common metrics and computational complexity. The framework’s interpretability makes it adaptable for use across observatories and other scientific facilities worldwide.
more » « less
Full Text Available
TAMS: Translation-Assisted Morphological Segmentation

Rice, Enora; Marashian, Ali; Gessler, Luke; Palmer, Alexis; von_der_Wense, Katharina (August 2024, Association for Computational Linguistics)

Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes.This is a core task in endangered language documentation, and NLP systems have the potential to dramatically speed up this process. In typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present a method that attempts to leverage translation data in the canonical segmentation task. We propose a character-level sequence-to-sequence model that incorporates representations of translations obtained from pretrained high-resource monolingual language models as an additional signal. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. Additionally, we find that we can achieve strong performance even without needing difficult-to-obtain word level alignments. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
more » « less
Full Text Available
Aligning to Adults Is Easy, Aligning to Children Is Hard: A Study of Linguistic Alignment in Dialogue Systems

https://doi.org/10.18653/v1/2024.hucllm-1.7

French, Dorothea; D’Mello, Sidney; von_der_Wense, Katharina (January 2024, ACL)

Full Text Available
It Is Not About What You Say, It Is About How You Say It: A Surprisingly Simple Approach for Improving Reading Comprehension

https://doi.org/10.18653/v1/2024.findings-acl.491

Shaier, Sagi; Hunter, Lawrence; von_der_Wense, Katharina (January 2024, Association for Computational Linguistics)

Full Text Available
Emerging Challenges in Personalized Medicine: Assessing Demographic Effects on Biomedical Question Answering Systems

Shaier, Sagi; Bennett, Kevin; Hunter, Lawrence; von_der_Wense, Katharina (August 2022, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (AACL 2023))

Full Text Available

Search for: All records